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In the last few years we have witnessed the emergence, primarily in on-line communities, of new 
types of social networks that require for their representation more complex graph structures than 
have been employed in the past. One example is the folksonomy, a tripartite structure of users, 
resources, and tags — labels collaboratively applied by the users to the resources in order to impart 
meaningful structure on an otherwise undifferentiated database. Here we propose a mathematical 
model of such tripartite structures which represents them as random hypergraphs. We show that it 
is possible to calculate many properties of this model exactly in the limit of large network size and 
we compare the results against observations of a real folksonomy, that of the on-line photography 
web site Flickr. We show that in some cases the model matches the properties of the observed 
network well, while in others there are significant differences, which we find to be attributable to 
the practice of multiple tagging, i.e., the application by a single user of many tags to one resource, 
or one tag to many resources. 



I. INTRODUCTION 

Networks are a versatile mathematical tool for rep- 
resenting the structure of complex systems and have 
been the subject of large volume of work in the last few 
years U 0, H H H. In its simplest form a network con- 
sists of a set of nodes or vertices, connected by lines or 
edges, but many extensions and generalizations have also 
been studied, including networks with directed edges, 
networks with labeled or weighted edges or vertices, and 
bipartite networks, which have two types of vertices and 
edges running only between unlike types. 

Recently, however, new and more complex types of net- 
work data have become available, especially associated 
with on-line social and professional communities, that 
cannot adequately be described by existing network for- 
mats. One example is the folksonomy. "Folksonomy" is 
the name given to the common on-line (and sometimes 
off-line) process by which a group of individuals collabo- 
ratively annotate a data set to create semantic structure. 
Typically mark-up is performed by labeling pieces of data 
with tags. A good example is provided by the on-line 
photography resource Flickr, a web site to which users 
upload photographs that can then be viewed by other 
users. Flickr allows any user to give a short description 
of any photo they see, usually just a single word or a few 
words. These are the tags. In principle, tags can allow 
users to do many things, such as searching for photos 
with particular subjects or clustering photos into topical 
groups. There are also many other websites and on-line 
resources with similar tagging capabilities, but dealing 
with different resources. On the website CiteUlike, for 
example, users upload academic papers as opposed to 



photographs and label them with descriptive tags. 

Researchers have taken a variety of approaches to the 
representation of folksonomy data using network meth- 
ods, including modeling them as simple unipartite graphs 
and bipartite graphs as well as limited forms of tripartite 
graphs [f| 0, S HI- Each of these approaches, however, 
fails to capture some elements of the structure of the data 
and hence limits the conclusions that can be drawn from 
subsequent network analysis. 

The fundamental building block in a folksonomy is a 
triple consisting of a resource, such as a photograph, a 
tag, usually a short text phrase, and a user, who applies 
the tag to the resource. Any full network representation 
of folksonomy data needs to capture this three-way rela- 
tionship between resource, tag, and user, and this leads 
us to the consideration of hypergraphs. 

A hypergraph is a generalization of an ordinary graph 
in which an edge (or hyperedge) can connect more than 
two vertices together. To represent our folksonomy we 
make use of a tripartite hypergraph, a generalization of the 
more familiar bipartite graph, in which there are three 
types of vertices representing resources, tags, and users, 
and three-way hyperedges joining them in such a way 
that each hyperedge links together exactly one resource, 
one tag, and one user. Each hyperedge corresponds to 
the act of a user applying a tag to a resource and hence 
the tripartite hypergraph preserves the full structure of 
the folksonomy — see Fig. [T] 

In this paper, we study the theory of such tripartite 
graphs, starting with basic network properties such as 
degree distributions and then developing a random graph 
model that allows us to make analytic predictions of a 
variety of network properties. We test our predictions by 
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FIG. 1: Vertices in our networks come in three types, rep- 
resented here by the red circles, green diamonds, and blue 
squares, and are connected by three-way hyperedges that each 
join together exactly one circle, one diamond, and one square. 
In the language of folksonomies, the circles represent, say, the 
resources, the diamonds the tags, and the squares the users. 



comparing them with data from the Flickr folksonomy 
and find good agreement in some, but not all, cases. 



II. TRIPARTITE GRAPHS 

We begin our study of tripartite hypergraphs by outlin- 
ing some of the basic properties of such networks. Our 
tripartite graphs have three different types of vertices, 
which, to preserve generality, we will refer to as red, 
green, and blue vertices. (In this paper, when discussing 
applications of the theory to folksonomies, red will rep- 
resent resources, green tags, and blue users, but the the- 
ory itself is entirely agnostic about what the colors rep- 
resent.) Let us suppose that there are n r red vertices, 
n g green ones, and rib blue ones. 

The edges in our network are three-way hyperedges 
that each connect one red, one green, and one blue ver- 
tex. (We might say that the hyperedges are "colorless" 
or "white," since red, green, and blue make white when 
combined in the human visual system.) Let us suppose 
there to be m hyperedges in total. 

There are a number of ways in which vertex degree 
can be defined for a hypergraph. Some authors, for in- 
stance, have defined degree as the total number of other 
vertices to which a given vertex is connected by hyper- 
edges. This corresponds to the definition of degree in an 
ordinary graph (at least when there arc no multiedges 
or self-edges), but in failing to distinguish between the 
different types of vertices to which hyperedges are con- 
nected, it can lead to confusion in the hypergraph case. 
The best, and also simplest, definition of degree for a ver- 
tex in a hypergraph is simply the number of hyperedges 
attached to that vertex. Thus a red vertex participating 
in four hyperedges has degree four. This might mean that 
it has four green and four blue neighbors in the network, 
but it is also possible that some neighboring vertices are 
common to more than one hyperedge, in which case the 
number of neighboring vertices of a given color may be 
smaller than four. 



The mean degree c r of a red vertex in our network is 
given by the number of hyperedges in the network divided 
by the number of red vertices, and similarly for green and 
blue: 
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Rearranging these equations to give three separate ex- 
pressions for m, we also have, 



n b Cb 



(2) 



Thus the mean degrees of the different vertex types can- 
not be chosen independently, but are linked via the fact 
that the same hyperedges connect to the red, green and 
blue vertices. 

One of the most important parameters of a network is 
its degree distribution. Just as bipartite networks have 
two distinct degree distributions, our tripartite ones have 
three: we define p r (k) to be the fraction of red vertices in 
the network that have degree k, and p g {k) and Pb{k) to be 
the corresponding quantities for green and blue vertices. 
These distributions satisfy the sum rules 
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As with bipartite graphs, it is sometimes convenient to 
form "projections" of tripartite graphs onto a subset of 
their vertices. In a bipartite graph of red and green ver- 
tices, for instance, one forms a projection onto the red 
vertices alone by constructing the network of red ver- 
tices in which vertices are connected by an edge if they 
share a common green neighbor in the original bipartite 
graph 

While for bipartite graphs there is essentially only 
one way of performing projections, there are several dis- 
tinct possibilities for tripartite graphs — see Fig. [2] One 
can again join two red vertices if they share a green 
neighbor — in our Flickr example from the introduction, 
two photos would be connected if they have a tag in com- 
mon. Or one can join two red vertices that share a com- 
mon blue neighbor — two photos that were tagged by the 
same user. Or one could join vertices that share either 
a green or a blue neighbor. And of course one can de- 
fine the equivalent projections onto the green and blue 
vertices. 

But it doesn't stop there. In a tripartite network, one 
can also form projections onto two of the colors. For in- 
stance, one can form a projected bipartite network of red 
and green vertices, in which a red and a green vertex are 
connected by an ordinary edge if they were connected by 
a hyperedge in the original network. Thus one can create 
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theory of random tripartite hypergraphs with given de- 
gree distributions, which turn out to model many of the 
properties of real tripartite graphs quite effectively. 



The model 




FIG. 2: Ways of projecting a tripartite graph onto one of its 
vertex types (red in this case). Red vertices in the projected 
graph can be connected if they share a green neighbor (green 
edges in the projected graph), a blue neighbor (blue edges), 
or a neighbor of either kind (all edges together). 



a network of, for example, photos and the tags applied 
to them, while dropping information about which users 
applied which tags. And again one can also construct 
the equivalent projections onto red/blue and blue/green 
vertex combinations. Alternatively, one can construct a 
red/green network by connecting any pair of vertices — 
of different colors or not — if they share a common blue 
neighbor. Thus a tag would be connected to a photo if 
any user applied that tag to that photo, but tags would 
also be connected to other tags that were used by the 
same user. 

Many other standard concepts in the theory of net- 
works can be generalized to tripartite graphs, including 
clustering coefficients, correlations between the degrees 
of adjacent vertices (including three-point correlations), 
community structure and modularity, motif counts, and 
more. The concepts introduced above, however, will be 
sufficient for our purposes in this paper. 



III. RANDOM TRIPARTITE GRAPHS 

In theoretical studies of networks, random graph 
models have received particular emphasis because they 
capture many of the essential properties of networked 
systems in the real world while simultaneously being 
amenable to analytic treatment. A variety of random 
graph models have been studied, from models of simple 
undirected or directed graphs to more complicated exam- 
ples with correlations, communities, or bipartite struc- 
ture [IS HI GJ, H 03 ■ In this section we develop the 



Consider a model hypergraph with n r red vertices, 
n g green vertices, and rib blue vertices. Each vertex is 
assigned a degree, corresponding to the number of hy- 
peredges it will have. These degrees can be visualized as 
"stubs" of hyperedges emerging from each vertex in the 
appropriate numbers. The degrees must satisfy Eq. |2]), 
so that the total number of stubs emerging from vertices 
of each color is the same and equal to the total desired 
number of hyperedges to. 

A total of m three-way hyperedges are now created by 
choosing trios of stubs uniformly at random, one each 
from a red, green, and blue vertex, and connecting them 
to form hyperedges. This model is the equivalent for our 
tripartite graph of the so-called "configuration model" 
for unipartitc graphs [1 21 ] and the random bipartite graph 
model of for bipartite graphs. 

Given the definition of the model, we can, for example, 
calculate the probability that a hyperedge exists between 
a given trio of vertices i, j, k. In the process of creating 
a single hyperedge, the probability that we will choose a 
specific stub attached to red vertex i is 1/m, since there 
are a total of m stubs attached to red vertices and we 
choose uniformly among them. If i has degree ki then 
the total probability of choosing a stub from vertex i is 
ki/m. Similarly the probability of choosing stubs from 
green and blue vertices j and k are kj/m and k^jm. 
Given that there are m hyperedges in total, the overall 
probability of a hyperedge between i, j, and k is then 
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(5) 



Via a similar argument, the probability that there is a 
hyperedge connecting a particular red/green pair i,j (or 
any other color combination) is kikj/m. Note that in 
a sparse graph in which the typical degrees remain con- 
stant as the size of the graph increases, both of these 
probabilities vanish as 1/m. Among other things, this 
implies that the chance of occurrence of small loops in 
the network vanishes in the limit of large graph size. In 
the language of graph theory, one says that the network 
is locally tree-like, a property that will be important in 
the developments to follow. 

Rather than specifying the degree of every vertex in 
the network, we can alternatively specify just the degree 
distributions p r {k), p g (k), and Pb{k) of the three vertex 
types (constrained to satisfy the sum rules and ([3])), 
then draw a specific sequence of degrees from those dis- 
tributions and connect the vertices as before. As a prac- 
tical matter, if one wanted to generate actual example 
networks on a computer, one would need to ensure that 
the degrees satisfied Eq. @, which in general they will 
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not on first being drawn from the distributions. A sim- 
ple strategy for ensuring that they do is first to draw a 
complete set of degrees and then repeatedly choose at 
random a trio of vertices, one of each color, discard the 
current values of their degrees, and redraw them from the 
appropriate distributions until the constraint is satisfied. 

The degree distributions represent the probability that 
a vertex of a given color chosen at random from the entire 
network has a given degree. If we choose a hyperedge at 
random, however, and follow it to the red, green, or blue 
vertex at one of its corners, that vertex will not have 
degree distributed according to p r (k), p g (k), or pb{k), 
and the reason is easy to see: vertices with many hy- 
peredges are proportionately more likely to be encoun- 
tered when following edges. A vertex of degree ten, for 
instance, has ten times as many chances to be chosen 
in this way than a similarly colored vertex of degree 
one. (And a vertex of degree zero will never be chosen 
at all.) Thus the distribution of degrees of vertices en- 
countered is proportional to kp r {k) for red vertices, and 
similarly for green and blue. Requiring this distribution 
to sum to unity, the correctly normalized distribution is 
kp r (k) I J^k kp r {k) = kp r (k)/c r . 

As in other random graph models, we are in fact usu- 
ally interested not in the degree of the vertex we en- 
counter but in the number of hyperedges attached to it 
other than the one we followed to reach it. This so-called 
excess degree, which is 1 less than the total degree, has 
the same distribution as above, but with the replacement 
k — ► k + 1 j giving an excess degree distribution of 
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and similarly for other vertex colors. 



B. Generating functions 

The fundamental tools we will use in calculating the 
properties of the random tripartite graph are probability 
generating functions. We begin by defining generating 
functions for the degree distributions thus: 
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Given these generating functions we can, for instance, 
easily calculate the means of the distributions: c r = r' (l) 
and so forth. Higher moments are also straightforward. 
We also define corresponding generating functions for 



the excess degree distributions: 
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C. Projections 

As a first example, we use our generating functions to 
calculate the degree distribution for the projection of a 
tripartite random graph onto one of its vertex types, as 
described in Section[Tl] Consider first the projection onto 
(say) red vertices in which two red vertices are joined by 
an edge if they share a green neighbor. (The blue vertices 
are ignored in this projection.) 

Suppose a given red vertex A has s green neighbors 
and each of those green neighbors has t red neighbors 
other than vertex A. Given that s is distributed accord- 
ing to p r (s) and t is distributed according to q g (t), the 
probability p g (k) that A has exactly k neighbors in the 
projected network is 
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where 5(i,j) is the Kronecker delta. Multiplying both 
sides by z k and summing over k, the generating function 
for this probability distribution is, 



oo oo j s 

x E 3ff(*i)---E 3B( i «) 5 (*'E t ' 

ti=0 t 3 =0 ^ n=l 

oo oo oo 

= E^00 E 3ff(*i)--E laits)^^ 
s=0 *i=0 t»=o 

oo oo oo 

= Y. Pr ^ E 9g(*i) ztl ■ ■ ■ E laits) 2 *'* 



s=0 
oo 



*i=0 

oo 



s=0 

ro(gi(z)). 



t=0 



E^( s )[fi(z)] ; 



s=0 



(10) 



We can also calculate the generating function for the 
projection in which two red vertices are connected by 
an edge if they share either a green or a blue neighbor. 



The probability for a vertex to have k neighbors in this 
network is 
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and the corresponding generating function is 
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We can use this result to calculate, for instance, the 
average degree in the projected network, which is given 
by 



R' gb (l) =r' (l)[b[(l)+ g[(l)]. 



(13) 



We will also use it in Section Hvl to compare predictions 
of the random graph model with real-world networks. 



D. Formation and size of the giant component 

In this section we examine the component structure of 
our model network, focusing on the giant component. As 
with all networks, if our tripartite network is sufficiently 
sparse — if it has very few edges for the given number of 
vertices — then vertices will be connected together only 
in small groups or small components. If, however, the 
number of edges is sufficiently high, then a fraction of 
the vertices will join together into a single large group, 
the giant component, with the remainder in small compo- 
nents. There is a phase transition with increasing density 
at which the giant component forms that is closely anal- 
ogous to the phase transition in classical percolation. 

There is more than one possible definition of a compo- 
nent in our tripartite network, but the simplest approach 
is to define it as a set of vertices of any colors that are 
connected via hyperedges such that every vertex in the 
set is reachable from every other by some path through 
the network. Thus the collection of vertices depicted in 
the top panel of Fig. [5] constitutes a component in this 
sense. 

When viewed in the context of folksonomies, compo- 
nents, and particularly the giant component, play an im- 
portant practical role. In a folksonomy such as that of 
Flickr, the photography web site, users can "surf" be- 
tween photographs by traversing the hypergraph. A user 
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FIG. 3: If a hyperedge (outlined in bold) is not to belong to 
the giant component, then it must be that none of the hyper- 
edges reachable via, for instance, its red vertex are themselves 
members of the giant component. 



can, for example, click on the tag associated with a photo 
and see a list of other photos with the same tag. Similarly 
a user can click on the name of another user and see a list 
of photos that user has tagged. The existence, or not, of 
a giant component in the network dictates whether this 
type of surfing is actually useful or not. If there is no 
giant component, then surfing users will find themselves 
restricted to the small set of photos, tags, and users in 
the component in which they start their surfing. But if 
there is a giant component then users will be able to surf 
to a significant fraction of all photos on the entire web 
site just by clicking on tags or users that seem interest- 
ing. The same considerations affect automated surfing 
by computerized "crawlers" that crawl web sites either 
to perform directed searches (so-called "spiders") or to 
create indexes for later search. If there is no giant com- 
ponent in the folksonomy, then it cannot be crawled in a 
useful way. 

We can calculate properties of the giant component 
in our tripartite random graph by methods similar to 
those used for ordinary random graphs jlOj. Consider 
a randomly chosen hyperedge in the full hypergraph, as 
depicted in Fig. [3l and let us calculate the probability 
that this hyperedge is not a part of the giant component. 
We define u r to be the probability that the hyperedge is 
not connected to the giant component via its red vertex, 
and similarly for u g and u b , so that the total probability 
of not belonging to the giant component is u r u g Ub- 

Suppose that the excess degree of the red vertex — the 
number of other hyperedges attached to it — is k. (In the 
example shown in Fig. [3] we have k = 3.) In order that 
the hyperedge be not connected to the giant component 
via the red vertex it must be that none of these other 
hyperedges are connected to the giant component either. 
Any one hyperedge satisfies this criterion with probabil- 
ity UgUb — the probability that neither of its other corners 
lead to the giant component — and all k of them together 
do so with probability (u g u b ) k . 

The excess degree is distributed according to the dis- 
tribution q r {k) defined in Eq. Averaging over this 
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distribution, we then derive an expression for u r thus: 

oo 

U r = 'Y^ ( lr{k){u g U b ) k = n(UgUb). (14) 
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Similarly we can show that 



gi{u b u r ), u b = bi(u r u g ). 



(15) 



The simultaneous solution of these three equations for 
u r , u g , and u b then allows us to calculate the probability 
1 — u r u g Ub that a randomly chosen hyperedge is in the 
giant component. Alternatively, the probability that a 
randomly chosen red vertex is not in the giant component 
is the probability that none of its k hyperedges lead to the 
giant component, which is J2kPr(k)( u g u b) k = ro(u g Ub), 
so the that a red vertex is in the giant component with 
probability 



S r = 1 - r (u g u b ), 



(16) 



and we can write similar equations for S g and S b - S r can 
also be thought of as the fraction of red vertices in the 
giant component, and hence is a measure of the size of 
that component. The absolute number of red vertices in 
the giant component is n r S r and the number of vertices 
of all colors is n r S r +n g S g + n b S b - 

As in other random graph models, it is in most cases 
not possible to solve Eqs. fH]) and (fT5j> for u r , u g , and u b 
in closed form, but a numerical solution can be found 
easily by iteration starting from suitable initial values. 

We can also derive a condition for the existence of a 
giant component in the network. A giant component ex- 
ists if and only if u r , u g , and Ub are all less than 1. (They 
must all be less than 1 because an extensive giant com- 
ponent of vertices of any one color automatically implies 
an extensive component of the other two colors, since, 
with only mild conditions on the degree distribution, the 
first color must be connected into a giant component by 
an extensive number of hyperedges, and each hyperedge 
is attached to one vertex of each color.) 

Consider values of the variables that are only slightly 
different from 1 thus: 

u r = 1 — e r , u g = l — e g , Ub = l — e b , (17) 

where e r , e g , and are small. Then, from Eq. (|14[) . 

e r = 1 - u r = 1 - ri(u g u b ) = 1 - r x (l - e g - e b + e g e b ) 
= (e 9 + e 6 K(l) + 0(£ 2 ), (18) 

where we have performed a Taylor expansion of n and 
made use of fi(l) = 1 (which is necessarily true if q r (k) 
is a properly normalized distribution). We can derive 
similar equations for e g and e& and combine all three into 
the single vector equation 



(19) 




where we have introduced the shorthand r — ^(l), g — 
fli(l), and 6 = ^(1). 

If u r , u g , and u b are to be less than 1, meaning the 
corresponding e's must all be non-zero, then this equation 
implies the determinant condition 
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or 



2rgb + rg + gb + br = 1. 



(20) 



(21) 



This condition defines the point at which the phase 
transition takes place. Equivalently, Irgb + rg + gb + br 
crosses 1 at the transition. In fact it is greater than 1 
when there is a giant component and less 1 when there 
is none (rather than the other way around) as can be 
shown by exhibiting any example where this is the case. 
A suitable example is provided by a network in which 
all vertices have degree one, which clearly has no giant 
component. This choice makes r = g = b = and the 
result follows. 

Thus our condition for the existence of a giant compo- 
nent is, 



2rgb + rg + gb + br > 1. 



(22) 



This is the equivalent of the well known condition of Mol- 
loy and Reed for the existence of a giant component in a 
unipartite random graph [l2l |. 

An alternative form for this condition can be derived 
by making use of Eqs. ((6|) and ([8j) to write 
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and similarly for g and b. Here (. . .) r indicates an average 
over the degree distribution of the red vertices and c r = 

(k) r - 

Substituting these expressions into (|2"2"]) . we find, after 
some algebra, that 



(k)r (k)g (k) b 

(k 2 ) r + (k*) g + {k*) b 



< 2. 



(24) 



This form is particularly pleasing, since it has the same 
general shape as the criterion of Molloy and Reed for the 
unipartite case, which can be written as (k)/(k 2 ) < h. 

E. Other types of components 

The definition of a component used in the previous sec- 
tion is not the only one possible for our tripartite graph. 
In some folksonomies one cannot surf over connections 
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formed by both users and tags. In some cases, for in- 
stance, one is barred from seeing which resources a par- 
ticular user has tagged for privacy reasons, meaning one 
can surf between resources with the same tag, but not 
with the same user. In this case we are surfing on the 
network formed by two colors of vertices only, say red 
and green. 

We can approach this situation using the same tech- 
niques as in the previous section. We define probabilities 
u r and u„ as before and find that they satisfy the equa- 



tions, 
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(25) 



Linearizing around the point u r = u g = 1 we then find 
that the transition at which the giant component appears 
takes place when 



-1 r 

9 -1 



0, 



(26) 



or equivalently rg — 1, with r and g defined as before. 
By considering appropriate special cases, one can then 
show that the giant component exists if and only if rg > 
1. Substituting from Eq. (j2"3")l , we can also write this 
condition in the form, 



(k)r , (k) 6 



{k 2 )r 



(k 2 ), 



< 1. 



(27) 



Note that this expression is not symmetric with respect 
to permutations of the three color indices, as Eq. 
was. This means that in general giant components for 
different color pairs will appear at different transitions, 
and it is possible to have a giant component for one pair 
without having a giant component for another. Thus for 
instance in our Flickr example one might be able to surf 
the network of photos and tags, but not the network of 
photos and users. (Actually, one can surf both just fine 
in the real Flickr network.) 



F. Percolation 

One can also consider percolation processes on tripar- 
tite networks. If some vertices are removed from the 
network then the remaining network may or may not 
percolate, i.e., possess a giant component. For exam- 
ple, on the Flickr web site users can designate photos 
as publicly viewable or not, and those that are not are, 
for all intents and purposes, removed from the network. 
One cannot use them, for instance, for surfing across 
the network. There are many ways in which vertices 
might be removed, but as a simple example let us as- 
sume that vertices of only one kind are removed and 
make the standard percolation assumption that they are 
removed uniformly at random. (More complicated per- 
colation schemes are certainly possible, with more than 
one type of vertex removed, different probabilities of re- 
moval for different types, or nonuniform removal, and all 



of these schemes can be studied by methods similar to 
those outlined here.) 

Suppose a fraction <f> of the red vertices in our network 
are present (or functional) and 1 — <f> are removed (or 
nonfunctional) . In the language of percolation theory, a 
fraction <p of the vertices are occupied. Then define 
before to be the probability that the red vertex attached 
to a random hyperedge does not belong to the giant com- 
ponent, or the giant cluster as it is more commonly called 
in the percolation context. There are two different ways 
in which this can happen. If the vertex itself has been 
removed, then it does not belong to the giant cluster. Al- 
ternatively, it may be present but, as before, none of its 
neighbors, either blue or green, are in the giant cluster. 
This allows us to write down an expression for u r thus: 



= 1 - 4> + 4>ri(ugUb). 



(28) 



The corresponding expressions for u g and Ub are the 
same as in our previous calculation, u g = gi(ubU r ), 
Ub = bi{u r u g ), and the fractions of red, green, and blue 
vertices in the giant percolation cluster are 



S r = 4>[1 - r (u g Ub)}, 
S g = l- go{u b u r ), 
Sb = l- b {u r u g ). 



(29a) 
(29b) 
(29c) 



We can also calculate an expression for the value of 
at which the percolation transition happens. As be- 



fore we perturb around the point u r = 



= u b = 1 



that corresponds to no giant cluster and the equivalent 
of Eq. CEU) is 



4>r 4>r\ I e r 
= I 9 g I I e n 



(30) 



b b 



with r, g, and b defined as before. This implies that the 
transition happens at <p = 4>c where 4> c is the solution of 
2<j>rgb + 4>rg + gb + cf>br = 1. That is, 



l-gb 



r(2gb- 



b) 



(31) 



Making use of Eq. (|23| and the corresponding expressions 
for g and b we then find that 



(k 2 )r 
(k) r 



-1 



-1 r 



2- 



(k) g (k) b 
(k 2 ) 9 (k 2 )b 



-1 



(32) 



G. Simulations 



Before looking at real-world tripartite networks, we 
first compare our calculations with simulation results for 
computer-generated random graphs. 
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Consider a tripartite random graph with Poisson de- 
gree distributions thus: 

Pr(k) = e -o r ^ Pg[k) = e -cA Ph{k) = e -oA 

(33) 

where the average degrees c r , c g , and satisfy Eq. @. 
The corresponding generating functions are 

r (z) = n(z) = e- c " ^ = e^- 1 ), 

fc=0 
oo k 

g ( Z )=g 1 ( Z )=e-^J2^l zk = eCB{Z ~ 1) ' 

OO 

b (z) = h(z) = c- Cb = eCbiZ ~ 1] - ( 34 ) 

fc=0 

We can use these to calculate, for instance, the degree 
distribution of the projection of the network onto the 
red vertices in which two vertices are connected if they 
share either a green or a blue neighbor. The generating 
function for this distribution is given by Eq. I|12p to be 

R gb = r ( gi (z)h(z)) = e <*C- Ce ' +0 » ,( - 1J -i>. (35) 

Expanding in powers of z, we then find that the proba- 
bility p g b(k) of a red vertex having exactly k neighbors 
in the projected network is 



Pgb(k) = 




where { ^ } is a Stirling number of the second kind, 
i.e., the number of ways of dividing k objects into m 
nonempty sets [15j . 

The main panel of Fig. 0] shows the form of this distri- 
bution for the case c r = 3, c g = 10, c b = 6. In the same 
plot we show the results of simulations in which random 
tripartite graphs with the same degree distributions and 
n r = 100 000, n g = 30 000, and n b = 50 000 were gen- 
erated and then explicity projected onto the red vertices 
and the resulting degree distribution measured directly. 
As the figure shows, the agreement between the two is 
excellent. 

The inset of Fig. 2] shows the size of the giant cluster for 
percolation on the red vertices of the same network as a 
function of the occupation probability (f>, calculated both 
by numerical solution of Eqs. ([28|) - (f29|) and by direct 
measurement on simulated networks. Again the agree- 
ment is excellent. 

IV. COMPARISON WITH REAL- WORLD DATA 

In this section we compare the predictions of our tri- 
partite random graph model against data for the folkson- 
omy of the Flickr photo-sharing web site. As we show, 




FIG. 4: The degree distribution for the projection of our Pois- 
son hypergraph onto its red vertices alone, in which two red 
vertices are joined by an edge if they have either a green or a 
blue neighbor in common on the original tripartite network. 
The solid line is the exact solution, Eq. (|36jl . and the points 
are the results of numerical simulations averaged over a hun- 
dred realizations of the network. The error bars are smaller 
than the size of the points in all cases. Inset: The fraction of 
red vertices belonging to the giant percolation cluster for site 
percolation on the tripartite network, as a function of occu- 
pation probability <f). The solid line is the exact solution and 
the points are the results of numerical simulations. 



the theory and empirical observations agree well in some 
respects, but less well in others. In many ways the dis- 
crepancies are at least as interesting as the cases of agree- 
ment, since they indicate situations in which the struc- 
ture of the observed network cannot be explained by a 
simple random model that ignores social and other ef- 
fects. When data and model disagree it is a sign that 
these effects are important in determining the network 
structure. Thus, as with other random graph models, 
one of the most significant roles our model can play may 
be as a null model that allows the experimenter to deter- 
mine when a network is doing something nontrivial. 

Our example data set represents the folksonomy net- 
work of 266 198 photos added to the Flickr web site by its 
users during 2007, along with the tags applied to those 
photos and the users who applied them. The first step 
in analyzing the data is to measure the three degree dis- 
tributions for the three types of vertices. The degree 
distributions are shown in Fig. \5\ As is common in most 
social networks, they are highly right-skewed, meaning 
there are many vertices of low degree and a small num- 
ber of very high degree, although the distributions do 
not follow power-law forms as the distributions in some 
networks do. Using these distributions, we can, following 
Eqs. and (jHJ), construct the corresponding generating 
functions, which are simple polynomials (albeit of high 
order) that can be easily evaluated numerically. 

We can use our generating functions to calculate, for 
example, the generating functions R gb (z) and so forth for 
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1000 



FIG. 5: The three degree distributions of the tripartite Flickr 
folksonomy network for photos (red), tags (green), and users 
(blue). 



the degree distributions of the projections of the network 
onto one vertex type, using Eqs. (|10[) and (| 1 2[) and their 
equivalents for other vertex types. Again these functions 
can be rapidly evaluated for any argument z numerically. 
The degree distributions themselves are then given by 
derivatives of the generating functions thus: 



1 d k R 



Pk 



gb 



k\ dz h 



(37) 



z=0 



Direct numerical evaluation of derivatives is plagued by 
problems with noise and should be avoided, but one can 
get good results pjj by instead employing Cauchy's in- 
tegral formula for the A:th derivative of a function: 



dz k 



k\ 
2vri 



/(*) 



(z - Zq) 



fe+i 



(38) 



where the integral is around a contour enclosing the point 
zq but excluding any poles of f(z). Applying this formula 
to (|3"T)l we get 



Pk 



1 
2tti 



Rgb(z) 
7 k+i 



dz. 



(39) 



We then calculate the degree distribution by performing 
the contour integral numerically around a suitable con- 
tour (the unit circle \z\ = 1 works well). One can without 
difficulty calculate to good precision the first thousand or 
so coefficients of the generating function in this fashion. 

We have performed this calculation using the degree 
distributions of the Flickr network and projecting onto 
the resources, i.e., the photos. Figure [6] shows a compari- 
son of the results with the degree distribution for the ac- 
tual projected network. The upper solid line in the figure 
represents the theoretical result, while the circles repre- 
sent the measurements. Although the two curves have 
the same general shape, it's clear from the figure that 



FIG. 6: Circles show the cumulative distribution function for 
the degree distribution of the projection of the Flickr network 
onto its photograph vertices, while the upper solid line shows 
the predictions of the random graph model for the same quan- 
tity. Squares show the same function after pruning of the data 
to remove multiple tagging as described in the text and the 
lower solid curve shows the corresponding model prediction, 
recalculated from the new degree distributions after pruning. 



the agreement between them is only moderately good in 
this case. Upon closer inspection, however, it turns out 
that there is a relatively simple reason for this. 

As discussed in Section flll Al our random graph model 
assumes a locally-tree like structure for the tripartite net- 
work, a structure with no short loops. The Flickr net- 
work, on the other hand, turns out to have many short 
loops, which is why empirical measurements and model 
do not agree in Fig. [5] As we now show, however, the 
loops in the Flickr network are primarily of a trivial kind 
that can easily be allowed for in the calculations. 

Typically, photos are not added to the Flickr network 
individually, but in sets. The most common practice 
is for a user to upload a set of photos on a particular 
subject — say, pictures of a Ferrari motor car — and then 
label all of the photos in the set with the same set of 
tags — Ferrari, automobile, sports car, and so forth. This 
creates short loops between photos in the set of the form 
Pi — ► Ti — ► P2 — ► T2 — ► Pi , where the Ps are the photos 
and the Ts are tags. These loops will have an adverse 
affect on the calculation of the number of neighbors a 
photo has in the projected network, since in many cases 
two projected edges from a photo will lead to the same 
neighboring photo, rather than to different neighbors, 
and hence give a lower degree in the projected network 
than our naive random graph calculation. 

To test the effect of these "trivial" loops in the net- 
work structure, we have pruned the data set to remove 
instances of multiple tagging. In the pruned data set the 
application by a user of many tags to the same photo is 
represented by just a single hyperedge, rather than many. 
In this representation, hyperedges represent the act of 
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FIG. 7: Cumulative distribution functions for the degree dis- 
tributions of the projection of the Flickr network onto its user 
vertices, both before and after pruning of the data. The points 
represent the observations, unpruned (circles) and pruned 
(squares), while the solid lines represent the predictions of 
the model. 

tagging a photo, rather than a specific tag, and only one 
hyperedge is included between a user and a photo no 
matter how many tags the user applies. Similarly we 
also represent the tagging of many photos with the same 
tag by a single hyperedge, so that hyperedges represent 
the act of tagging an entire photo set, rather than just a 
single photo. This should remove most instances of triv- 
ial loops in the projected network of the type described 
above. 

Now we calculate again the projection of the hyper- 
graph onto the set of photos. We also recalculate the 
theoretical predictions to reflect the changed degree dis- 
tributions of the hypergraph following pruning. The re- 
sults are shown in Fig. [5] (squares and lower solid curve) 
and, as the figure shows, the agreement is now quite good 
between theory and observation. This suggests that the 
earlier disagreement between the two is indeed primarily 
a result of the presence of the loops in the hypergraph 
introduced by the practice of multiple tagging. 

We can perform similar calculations for projections 
onto other types of vertices. In Fig. [7] we show degree 
distributions, before and after pruning of the data set, 
for the projection onto users. Agreement between the- 
ory and observation for the unpruned data is again quite 
poor in this case but significantly better for the pruned 
data. 

These calculations provide, in many ways, a good ex- 
ample of the utility of random graph models. When com- 
pared with the raw data from the Flickr network, our 
random graph model agrees qualitatively, but not quan- 
titatively, indicating that there are effects present in the 
network that are not accounted for by simple random 



hyperedges. On the other hand, once one prunes the 
data to remove multiple tagging, the agreement becomes 
much better, suggesting that multiple tagging is the pri- 
mary nonrandom behavior taking place in the network 
and that in other respects the network is in fact quite 
close to being a random graph. Thus the model allows 
us not only to say when the network deviates from the 
random assumption, but also the particular nature of the 
deviation. 



V. CONCLUSIONS 

Motivated by the emergence of new types of social net- 
works, such as folksonomies, we have in this paper pro- 
posed and studied a model of random tripartite hyper- 
graphs. We have defined basic network measures, such 
as degree distributions and projections onto individual 
vertex types, and calculated a variety of statistical prop- 
erties of the model in the limit of large network size. 
Among other things we have calculated the explicit de- 
gree distributions for projected networks, conditions for 
the emergence of a giant component, the size of the gi- 
ant component when there is one, and the location of 
the percolation threshold for site percolation on the net- 
work. In principle, the techniques introduced could be 
extended to hypergraphs with more vertex types or ad- 
ditional types of edges, although we have not pursued 
any such extensions here. 

We have compared our results against measurements 
of computer-generated random hypergraphs and a real- 
world tripartite network, the folksonomy of the on-line 
photo-sharing web site Flickr. In the latter case, we 
have focused on the degree distributions of projections 
of the hypergraph onto one vertex type and find that 
in some instances the theory makes predictions in mod- 
erately good agreement with the observations while in 
others the agreement is poorer. In all cases, however, we 
find that agreement becomes significantly better when we 
remove instances of multiple tagging from the network — 
instances in which a user applies many tags to the same 
photo or the same tag to many photos — suggesting that 
the disagreement is primarily a result of relatively triv- 
ial structures in the network, rather than more subtle or 
large-scale social network effects. 
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